Modeling steps:
NA’s?ins <- read_csv("https://www.dropbox.com/s/bocjjyo1ehr5auz/insurance.csv?dl=1")
ins <- ins %>%
mutate(
smoker = factor(smoker)
) %>%
drop_na()
knn_mod <- nearest_neighbor(neighbors = 5) %>%
set_engine("kknn") %>%
set_mode("classification")
knn_recipe <- recipe(smoker ~ age + bmi + charges,
data = ins)
knn_wflow <- workflow() %>%
add_recipe(knn_recipe) %>%
add_model(knn_mod)
cvs <- vfold_cv(ins, v = 5)
knn_fit <- knn_wflow %>%
fit_resamples(cvs)What percent of our guesses were correct?
The problem: Consider this data.
[1] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[19] "B" "B" "B" "B" "B" "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[37] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[55] "A" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[73] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
[91] "B" "B" "B" "B" "B" "B" "B" "B" "B" "B"
If I guess “B” every time, I’ll have 98% accuracy!
ROC = “reciever operating charateristic** (ew)
False Positive Rate = (how many A’s did we say were B)/(how many did we say were “B” total)
How many did we misclassify as B?
True Positive Rate = (how many B’s did we say were B)/(how many B’s are there total)
How many true B’s did we miss?
ROC = plots TPR and FPR across many decision boundaries
First, find the probability that the model assigns each observation for the target category of your categorical variable. (You get to decide which is the target)
If we choose a cutoff of 0.5, what is our TPR and FPR?
If we choose a cutoff of 0.8, what is our TPR and FPR?
If we choose a cutoff of 0.2, what is our TPR and FPR?
GOOD: The ROC curve is way above the line (we can achieve a really good TP rate without sacrificing FP rate)
BAD: The ROC curve is on the line (FP/TP is a direct trade-off)
ROC-AUC is the area under the curve - large values are good!
1 = I always predict perfectly, no matter what the cutoff is. All my predicted probs are 0% or 100%.
0.5 = I predict just as well as random guessing. If guess more “yes”, I get more of the “no” wrong.
Below 0.5 = Yikes.
= (correctly guessed Category A)/(all actual Category A’s)
= TP/(TP + FN)
= (correctly guessed Category B)/(all actual Category B’s)
= TN/(TN + FP)
= (correctly guessed Category A)/(all actual Category A’s)
= TP/(TP + FN)
= Sensitivity!
= (correctly guessed Category A)/(all guessed Category A)
= TP/(TP + FP)